{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tv6vx7wooDfk"
   },
   "source": [
    "# ✨ Quantize & Finetune an SLM with Olive\n",
    "\n",
    "> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**\n",
    "\n",
    "In this notebook, you will:\n",
    "\n",
    "1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).\n",
    "1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.\n",
    "1. Optimize the fine-tuned model for the ONNX Runtime.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🐍 Install Python dependencies\n",
    "\n",
    "The following cells create a pip requirements file and then install the libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile requirements.txt\n",
    "\n",
    "olive-ai==0.7.1\n",
    "transformers==4.44.2\n",
    "autoawq==0.2.6\n",
    "optimum==1.23.1\n",
    "peft==0.13.2\n",
    "accelerate>=0.30.0\n",
    "scipy==1.14.1\n",
    "onnxruntime-genai==0.6.0\n",
    "torchvision==0.18.1\n",
    "tabulate==0.9.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZtY3VYxCoDfm"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "%pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🤗 Login to Hugging Face\n",
    "\n",
    "In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!huggingface-cli login --token USER_ACCESS_TOKEN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🗜️ Quantize the model using AWQ\n",
    "First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.\n",
    "\n",
    "You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.\n",
    "> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!olive quantize \\\n",
    "    --model_name_or_path \"meta-llama/Llama-3.2-1B-Instruct\" \\\n",
    "    --trust_remote_code \\\n",
    "    --algorithm awq \\\n",
    "    --output_path models/llama/awq \\\n",
    "    --log_level 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nxJCT5wioDfp"
   },
   "source": [
    "## 🏃 Train the model\n",
    "\n",
    "Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:\n",
    "\n",
    "```jsonl\n",
    "{\"phrase\": \"The sudden thunderstorm caught me off guard.\", \"tone\": \"surprise\"}\n",
    "{\"phrase\": \"The creaking door at night is quite spooky.\", \"tone\": \"fear\"}\n",
    "{\"phrase\": \"Celebrating my birthday with friends is always fun.\", \"tone\": \"joy\"}\n",
    "{\"phrase\": \"Saying goodbye to my pet was heart-wrenching.\", \"tone\": \"sadness\"}\n",
    "```\n",
    "\n",
    "Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).\n",
    "\n",
    "In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.\n",
    "\n",
    "> ⏳ **It takes ~6mins to complete the Finetuning**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8t36pRF2oDfq"
   },
   "outputs": [],
   "source": [
    "!olive finetune \\\n",
    "    --method lora \\\n",
    "    --model_name_or_path models/llama/awq \\\n",
    "    --trust_remote_code \\\n",
    "    --data_name xxyyzzz/phrase_classification \\\n",
    "    --text_template \"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n{tone}<|eot_id|>\" \\\n",
    "    --max_steps 300 \\\n",
    "    --output_path models/llama/ft \\\n",
    "    --log_level 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7woNXLDF0bhh"
   },
   "source": [
    "## 🪄 Automatic model optimization with Olive\n",
    "\n",
    "Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:\n",
    "\n",
    "1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.\n",
    "1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).\n",
    "1. Extract the fine-tuned LoRA weights and place them into a separate file.\n",
    "\n",
    "> ⏳**It takes ~2mins for the automatic optimization to complete**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "M-prKBy20U5m"
   },
   "outputs": [],
   "source": [
    "!olive auto-opt \\\n",
    "    --model_name_or_path models/llama/ft/model \\\n",
    "    --adapter_path models/llama/ft/adapter \\\n",
    "    --device cpu \\\n",
    "    --provider CPUExecutionProvider \\\n",
    "    --use_ort_genai \\\n",
    "    --output_path models/llama/onnx-ao \\\n",
    "    --log_level 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8Uwm432loDfr"
   },
   "source": [
    "## 🧠 Inference\n",
    "\n",
    "The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: \"Cricket is a wonderful game\") and the app will output a chat completion using:\n",
    "\n",
    "1. The base model only (no adapter). You should notice that the model gives a verbose response.\n",
    "1. The base model **plus adapter**. You should notice that we get one word classification. \n",
    "\n",
    "In the code, you'll  notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.\n",
    "\n",
    "Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).\n",
    "\n",
    "To exit the chat interface, enter `exit` or select `Ctrl+c`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "puMdoAxjoDfr"
   },
   "outputs": [],
   "source": [
    "import onnxruntime_genai as og\n",
    "\n",
    "model_path = \"models/llama/onnx-ao/model\"\n",
    "\n",
    "model = og.Model(f'{model_path}')\n",
    "adapters = og.Adapters(model)\n",
    "adapters.load(f'{model_path}/adapter_weights.onnx_adapter', \"classifier\")\n",
    "tokenizer = og.Tokenizer(model)\n",
    "tokenizer_stream = tokenizer.create_stream()\n",
    "\n",
    "# Keep asking for input prompts in a loop\n",
    "while True:\n",
    "    phrase = input(\"Phrase: \")\n",
    "    prompt = f\"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n",
    "    input_tokens = tokenizer.encode(prompt)\n",
    "    \n",
    "    # first run without the adapter\n",
    "    params = og.GeneratorParams(model)\n",
    "    params.set_search_options(past_present_share_buffer=False)\n",
    "    generator = og.Generator(model, params)\n",
    "    generator.append_tokens(input_tokens)\n",
    "\n",
    "    print()\n",
    "    print(\"Output from Base Model (notice verbosity): \", end='', flush=True)\n",
    "\n",
    "    while not generator.is_done():\n",
    "            generator.generate_next_token()\n",
    "\n",
    "            new_token = generator.get_next_tokens()[0]\n",
    "            print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
    "    print()\n",
    "    print()\n",
    "    \n",
    "    # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n",
    "    del generator\n",
    "    \n",
    "     # now run with adapter\n",
    "    generator = og.Generator(model, params)\n",
    "    # set the adapter to active for this response\n",
    "    generator.set_active_adapter(adapters, \"classifier\")\n",
    "    generator.append_tokens(input_tokens)\n",
    "\n",
    "    print()\n",
    "    print(\"Output from Base Model + Adapter (notice single word response): \", end='', flush=True)\n",
    "\n",
    "    while not generator.is_done():\n",
    "            generator.generate_next_token()\n",
    "\n",
    "            new_token = generator.get_next_tokens()[0]\n",
    "            print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
    "    print()\n",
    "    print()\n",
    "    del generator"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "A100",
   "provenance": [],
   "toc_visible": true
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
